Analysis of Categorical Data

Analysis of Categorical DataIntroduction to categorical dataChi-Square TestsChi-square Goodness of Fit (GoF) testChi-square tests of independence

Introduction to categorical data

Many experiments result in measurements that are qualitative (categorical) rather than quantitative (numbers).

Example

 

We can represent each outcome as a length- vector such that if belongs to the th category and otherwise for .

Note the sample can be summarized by providing the counts of the number of measurements that fall into each of the categories, i.e., by the random vector such that .

 

Definition (multinomial distribution) Consider a random experiment such that

Then the counts for each category () follow a multinomial distribution with the number of trials and cell probabilities (), i.e. .

The pmf for is

where the cell probabilities sum up to , i.e.,

Remark 1: a Binomial distribution is a special case of the multinomial distribution when , i.e., .

Remark 2: Suppose are i.i.d., such that . Then .

 

Using observed counts (i.e., a realization of ), we would like to make inferences about the category probabilities .

 

Chi-Square Tests

Setting

We have a random sample where each . Equivalently, we have a random variable such that .

 

  1. Null and alternative hypothesis:
  1. Test statistic:
  1. Null distribution:

  1. Given a significance level , the rejection region for the observed test statistic : , where the observed test statistic .

    The test of "rejecting if the observed test statistic exceeds " is an approximate sig. level test.

 

Remark 1: this is an asymptotic test. The sample size needs to be large to approximate the null distribution with the distribution. A rule of thumb is to check whether for . In practice, we check whether each observed count exceeds or not.

Remark 2: this test has a connection with the asymptotic LRT. In fact, it can be shown that when is large. In particular, the degrees of freedom of the distribution is the difference of the number of free parameters in the null and full parameter spaces.

 

Example: A group of rats, one by one, proceed down a ramp to one of three doors. We wish to test the hypothesis that the rats have no preference concerning the choice of a door. Suppose that the rats were sent down the ramp times and that the three observed cell frequencies were , and .

 

 

 

 

 

 

 

 

 

 

 

Chi-square Goodness of Fit (GoF) test

The test above can be used to test whether sample data are from a given distribution or not.

Idea: partition the range of into distinct regions () and compare observed and expected counts for each region.

 

Example Let denote the number of heads that occur when four identical coins are tossed at random. Under the assumption that the four coins are independent and the probability of heads on each coin is , is . One hundred repetitions of this experiment resulted in , and heads being observed on , and trials, respectively. Do these results support the assumptions? Make a conclusion at .

 

 

 

 

 

 

 

 

 

 

Often, we are interested in testing whether the random variable of interest is from a certain family of distributions or not. Note, in such case, the distribution of is not completely specified under .

  1. Null and alternative hypotheses
  1. Test statistic:

where is a ML estimator of under (i.e., ).

  1. Null distribution:

    where is the length of (i.e., the number of free parameters under ).

  2. Given a significance level , the rejection region for the observed test statistic : , where the observed test statistic .

    The test of "rejecting if the observed test statistic exceeds " is an approximate sig. level test.

     

Example The number of accidents per week at an intersection was checked for weeks; Out of weeks, weeks had no accidents, weeks had one accident, and weeks had accidents. Test the hypothesis that the random variable has a Poisson distribution, assuming the observations to be independent. Use .

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Chi-square tests of independence

Two-way contingency table (a table for two categorical variables)

Example:

A statistics 415 instructor wants to know if there is a relationship between favorite color (red or yellow) and the preferred condiment on a corn dog. The following table summarizes the results.

  Condiment 
ColorKetchupMustardTotal
Red   
Yellow   
Total   

 

A general example of our contingency table with two classifying factors can be displayed as follows.

 Total
Total

 

In some problems, the counts , , can be modeled using a multinomial distribution

 

Chi-square tests of independence

Chi-square tests of independence aim to answer the question: for a single observation, is the row assignment statistically independent of the column assignment?

 

  1. In terms of hypotheses, we will test

, ,

vs.

for at least 1 pair.

 

Defining

we can rewrite and as

, ,

for at least 1 pair

 

  1. Test statistic:

 

where

 

  1. Under , follows a distribution with degrees of freedom-that is, .

 

  1. Given a significance level , the rejection region for the observed test statistic : .

Example:

Test whether there is a relationship between favorite color (red or yellow) and the preferred condiment on a corn dog at .

  Condiment 
ColorKetchupMustardTotal
Red   
Yellow   
Total